Pet Paradise: a data-driven vision of growing dog population around Lake Zurich. AI-generated by our team.
We introduce Pet Paradise, a successful Zürich-based pet shop business, wants to expand. However, the owner is not aware of current and future customer needs. The owner has approached us to advise what predictions we can make about canine breeds, ages, neighborhood concentrations and distributions, to figure out where to open the next Pet Paradise branch and what products to offer there.
The objective of our report is to advise Pet Paradise and predict dog breed trends, in order to facilitate targeted marketing. Our analysis delves into canine and pet owner data. We have explored Zürich’s neighborhoods to identify prevailing canine demographics and trends. So our predictive analysis extends beyond just current demographics to anticipate future trends, allowing Pet Paradise to stay ahead of evolving customer needs.
By monitoring shifts in dog ownership patterns, breed popularity, and lifestyle preferences (size, number of dogs), Pet Paradise can adapt its product offerings and marketing strategies.
With our help, Pet Paradise can leverage data-driven insights to grow their business and help Zurich’s canines live to their best health and dog-happiness. We envision Pet Paradise expanding into other cantons too if they foster the customer satisfaction to shape Switzerland’s pet industry landscape.
Dogs are an integral part of urban communities, with pet ownership having grown parallel to population over the past decades. Zürich and its canton boast the largest population of dogs of any Swiss region, as suggested by a 2013 study (Pospischil et al. 2013), proving the value of building a thorough data-driven interpretation of the markets associated with dog ownership.
Additionally, in a report (Statistik Stadt Zürich 1984) by the Zürich City Police, historical records and concerns due to the environmental impact of an ever increasing canine population have led to stricter laws against dog-produced waste and specific taxing for dog owners. The implementation of dog registration procedures, which date back several hundred years, has facilitated the collection of valuable statistical data, which provides a glimpse into the relationship between owners and their pets across time (with owner personal information limitations, due to privacy concerns). The existence of such cohesive and easily available data has served as a strong motivating factor for our team to undertake the current project.
Moreover, our data science team consists of individuals who are deeply passionate about dogs, each with varying degrees of personal experience in pet ownership, and understand the importance of analyzing the current dynamics between humans and dogs in Switzerland from an analytic perspective.
The analysis in this report is conducted purely for educational purposes, focusing solely on the statistical modeling and client recommendations. This means that within the confines of this project, sex values for an animal such as dogs and their human owners within the dataset are interpreted as binary. The findings presented here are not intended to reflect or endorse any particular social reflections related to sexual identity. We acknowledge that discussions surrounding sex identity are multifaceted for individuals and communities.
As mentioned above, the main data has been sourced from the dog register (Stadt Zürich 2024), having been collected and published by the Open Data Portal of the City Council of Zürich, under the name “Hundebestände der Stadt Zürich, seit 2015”. The description of the data set from the original source is as follows:
This dataset contains information on dogs and their owners from the municipal dog register since 2015. Information on the age group, Sex and statistical district of residence is provided for dog owners. The breed, breed type, sex, year of birth, age and color are recorded for each dog. The dog register is kept by the Dog Control Department of the Zurich City Police.
To ensure a seamless workflow and make variable interpretation easier for our group, we have undertaken several preparatory steps with the dataset. These include renaming columns and translating certain string values from German to English, along with performing some cleaning procedures.
The main source of data is the kul100od1001.csv file,
which contains a collection of 70,967 dog registrations with 33
variables.
For the English version, translations for the column names are defined and a function is employed to replace multiple patterns at once for content translation. This includes translations for age groups, sexes, breed types, and dog colors. After applying the translation function across all relevant columns, dog colors are also translated. From this point forward, we will refer to variables and items exclusively by their translated English names.
The next step involves identifying and marking the initial occurrence
of each OwnerId as unique within the dataset. This
distinction facilitates further analyses that may require the
identification of distinct entries.
Finally, a subset of relevant columns is extracted from the
comprehensive dataset, creating a streamlined dataframe named
df_EN_EDA. The subset includes essential fields such as
KeyDateYear, OwnerId, and details regarding
the dogs, including PrimaryBreed and
DogBirthYear. Additionally, the NumberOfDogs
column is converted from its original format to a numeric type, ensuring
that subsequent data analysis can utilize numerical operations.
We begin with a summary of the data, providing an overview of its structure and contents.
# Load dataset
str(df_EN)
## 'data.frame': 70967 obs. of 33 variables:
## $ KeyDateYear : int 2015 2015 2015 2015 2015 2015 2015 2015 2015 2015 ...
## $ DataStatusCd : chr "D" "D" "D" "D" ...
## $ OwnerId : int 126 574 695 893 1177 4004 4050 4155 4203 4215 ...
## $ OwnerAgeGroupCd : int 60 60 40 60 50 60 40 60 50 40 ...
## $ OwnerAgeGroup : chr "60 to 69 years old" "60 to 69 years old" "40 to 49 years old" "60 to 69 years old" ...
## $ OwnerAgeGroupSort: int 7 7 5 7 6 7 5 7 6 5 ...
## $ OwnerSexCd : int 1 2 1 2 1 2 2 2 2 2 ...
## $ OwnerSex : chr "male" "female" "male" "female" ...
## $ OwnerSexSort : int 1 2 1 2 1 2 2 2 2 2 ...
## $ DistrictCd : int 9 2 6 7 10 3 11 9 2 8 ...
## $ District : chr "Kreis 9" "Kreis 2" "Kreis 6" "Kreis 7" ...
## $ DistrictSort : int 9 2 6 7 10 3 11 9 2 8 ...
## $ QuarCd : int 92 23 63 71 102 34 111 92 21 81 ...
## $ Quar : chr "Altstetten" "Leimbach" "Oberstrass" "Fluntern" ...
## $ QuarSort : int 92 23 63 71 102 34 111 92 21 81 ...
## $ PrimaryBreed : chr "Welsh Terrier" "Cairn Terrier" "Labrador Retriever" "Mittelschnauzer" ...
## $ SecondaryBreed : chr "none" "none" "none" "none" ...
## $ MixedBreedCd : int 1 1 1 1 1 1 1 1 2 3 ...
## $ MixedBreed : chr "Pedigree dog" "Pedigree dog" "Pedigree dog" "Pedigree dog" ...
## $ MixedBreedSort : int 1 1 1 1 1 1 1 1 2 3 ...
## $ BreedTypeCd : chr "K" "K" "I" "I" ...
## $ BreedType : chr "Small stature" "Small stature" "Breed type list I" "Breed type list I" ...
## $ BreedTypeSort : int 1 1 2 2 1 1 1 1 2 2 ...
## $ DogBirthYear : int 2011 2002 2012 2010 2011 2010 2012 2002 2005 2001 ...
## $ DogAgeGroupCd : int 3 12 2 4 3 4 2 12 9 13 ...
## $ DogAgeGroup : chr "3 years old" "12 years old" "2 years old" "4 years old" ...
## $ DogAgeGroupSort : int 3 12 2 4 3 4 2 12 9 13 ...
## $ DogSexCd : int 2 2 2 2 1 1 1 1 2 2 ...
## $ DogSex : chr "female" "female" "female" "female" ...
## $ DogSexSort : int 2 2 2 2 1 1 1 1 2 2 ...
## $ DogColor : chr "black/brown" "brindle" "brown" "black" ...
## $ NumberOfDogs : int 1 1 1 1 1 1 1 1 1 1 ...
## $ unique_OwnerId : logi TRUE TRUE TRUE TRUE TRUE TRUE ...
As can be seen in the structure of the data, the set comprises several observations of diverse data types. Most variables are expressed three times as different types, as integers (coded and sorted form), as well as strings (text). Depending on their implementation in the study they have been selected in one of the three variants, therefore our selection of relevant observations can be summarized as follows:
Numerical values:
KeyDateYear: numerical value for the reference
yearOwnerId: numerical identifier for the owner of the
registered dogOwnerAgeGroup: referring to the owner’s age as a
10-year categoryDogBirthYear: numerical value for the birth year of the
dogDogAgeGroupSort: referring to the dog’s age at the time
of registrationNumberOfDogs: numerical counter of the dog count for
each dog ownerBinary variables:
DogSexText: numerical value indicating two states for
the biological sex of the dogString values:
District: the name of each larger district of Zürich
according to the official divisionQuar: the name of the smaller neighborhoods which
comprise the larger districtsPrimaryBreed and SecondaryBreed: referring
to dog race denominations and informationMixedBreed: additional information regarding race
mixing in the dogDogColor: a descriptive name for the color of the
dogBreedType: referring to the official dog type
classification according to the dog ordinance (Regierungsrat
2009)This series of R code snippets delves into the examination of key
features within the df_EN_EDA dataframe, focusing on the
identification and analysis of unique entries for
KeyDateYear, OwnerId, and
OwnerAgeGroup. Each code section is designed to extract
unique values, count these entries, and where applicable, visualize the
distribution. Such analysis is integral for understanding the dataset’s
diversity across different dimensions, helping to highlight temporal
coverage, ownership uniqueness, and demographic variations among
owners.
# Extract and count unique years
unique_years <- unique(df_EN_EDA$KeyDateYear)
number_of_unique_years <- length(unique_years)
print(number_of_unique_years)
## [1] 9
print(unique_years)
## [1] 2015 2016 2017 2018 2019 2020 2021 2022 2023
# Extract and count unique Owner IDs
unique_Owner <- unique(df_EN_EDA$OwnerId)
number_of_unique_Owner <- length(unique_Owner)
print(number_of_unique_Owner)
## [1] 15504
This section presents an interactive visualization that displays unique owner IDs by age group and sex for a selected year. The user interface allows the selection of a year and a sex, and the resulting plot shows the distribution of unique Owner IDs across different age groups based on the chosen criteria.
The following visualization shows the count of unique owner IDs across different age groups over the years. The plot is generated by aggregating unique owner IDs by age group and year, adjusting factor levels, and creating a line plot.
After confirming successful conversion, we aggregated the data to compute the total number of dogs per year. The resulting counts were then visualized using histograms to illustrate the distribution over the years.
Furthermore, to understand the trend in dog population over time, we calculated the percentage change between consecutive years. This allowed us to identify any notable fluctuations or patterns in the data.
## [1] 0
A heatmap was created to illustrate the distribution of dogs based on the sex and age group of their owners across different years. The data was organized by grouping it according to the year, owner’s age group, and Sex. Separate heatmaps were generated for each year to visualize the data for that specific period.
Each heatmap represents the total number of dogs in various age groups, categorized by the Sex of their owners. The color gradient within the heatmap indicates the intensity of dog ownership, with warmer colors representing higher dog counts.
The annual distribution of dog registrations across various districts is examined using an interactive Shiny application. Users can select a year to view the total count of dogs by district for that specific year. The data is processed and visualized to provide insights into the distribution patterns over time.
To enhance understanding of the distribution of dogs across different districts and introduce a Sex perspective into the analysis, the approach has been modified to include a breakdown by Sex. This adjustment allows observation of not only the geographical distribution but also Sex dynamics within the dog population each year.
This Shiny application provides an interactive visualization of the total count of dogs by district and owner’s Sex for a selected year. Users can select a year and the Sex of the owner to explore the distribution of dogs across different districts, facilitating insights into demographic and geographic trends in dog ownership.
To deepen the analysis of dog populations across different districts annually, the R script incorporates an additional layer of granularity by assessing dog counts not only by district but also by breed type. This enhancement provides a more detailed view of the diversity within the canine populations across various districts each year.
In an effort to provide a more comprehensive analysis of dog populations within various districts, the latest R script has been enhanced to include not only total counts by district but also a detailed breakdown by breed type and Sex. This enhancement aims to offer a deeper understanding of the diversity and demographics of canine registrations across different regions.
This Shiny application provides a clear and dynamic way to visualize the most popular dog breeds each year, allowing for the inclusion or exclusion of unknown breeds. The color-coding of breeds enhances readability and helps in quickly identifying trends.
This visualization enables users to explore the distribution of dog breeds across different districts for a selected year. It highlights the top 5 dog breeds in each district, allowing for a detailed understanding of breed trends and distribution patterns.
For the first chapter in the machine learning models we begin with a linear model that will test the effect of a categorical variable. To do this we set ourselves the following question: does the breed status have an effect on the age at which dogs are registered?
Understanding typical registration ages for different breeds will allow Pet Paradise to target marketing effectively, reaching owners at the right stage of their pet ownership journey. Additionally, these insights can inform strategic inventory management, anticipating demand for breed-specific products and offering tailored advice to enhance customer satisfaction.
In the context of the question at hand, it is worth noting that linear models will not be employed to generate predictions from the provided data. Rather, they will be utilized to determine whether there is a linear correlation between the various states of categorical variables (such as pedigree dog, non-pedigree, etc.) and the response variable, which denotes the age of the dog.
To answer this question we will consider the
DogAgeGroupSort as the response variable, and the different
levels of the categorical variable MixedBreed as
predictors. We now direct our attention to the following set of boxplots
showcasing the relevant variables.
Based on the boxplots above, there does seem to be a difference in dog ages based on their breed status. Interestingly, the most extreme outliers are associated with pedigree dogs. We will continue by defining a linear model.
lm.dogs.1 <- lm(DogAgeGroupSort ~ MixedBreed, data = df_EN_cleaned)
summary(lm.dogs.1)
##
## Call:
## lm(formula = DogAgeGroupSort ~ MixedBreed, data = df_EN_cleaned)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.4917 -3.7002 -0.7002 3.2998 17.2998
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.70021 0.01839 310.011 < 2e-16 ***
## MixedBreedMixed breed, both breeds known -0.37787 0.05838 -6.473 9.68e-11 ***
## MixedBreedMixed breed, both breeds unknown 0.58939 0.04715 12.501 < 2e-16 ***
## MixedBreedMixed breed, secondary breed unknown 1.79145 0.05841 30.670 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.14 on 70955 degrees of freedom
## Multiple R-squared: 0.01569, Adjusted R-squared: 0.01565
## F-statistic: 376.9 on 3 and 70955 DF, p-value: < 2.2e-16
The intercept refers to the pure breed dogs, while the following predictors represent the differences between themselves and the intercept. The summary of the linear model suggests there is strong evidence that the mean age of pedigree dogs is not equal to 0 at the time of registration, with a value of 5.7 years, while the other three breed categories’ ages differ significantly from that of pedigree dogs. The most noticeable difference is between pedigree dogs and those whose secondary breed is unknown, with the latter being 1.79 years older.
We follow up this insight by assessing the differences between each of them.
drop1 <- drop1(lm.dogs.1, test = "F")
drop1
# Single term deletions
Model:
DogAgeGroupSort ~ MixedBreed
Df Sum of Sq RSS AIC F value Pr(>F)
<none> 1216314 201636
MixedBreed 3 19384 1235698 202752 376.93 < 2.2e-16 ***
By performing single term deletions and evaluating the resulting statistics of the model, we find that the breed status does indeed have a significant effect on the age of dogs across its different levels, however limiting these observation to one level at a time.
We can further this insight by drawing a General Linear Hypothesis. We will consider all possible pairwise comparisons with a Tukey Honest Significant Difference Test.
In the above Tukey Honest Significant Difference Test the similarity between means of different pairs is shown based on how different they are, with closeness to 0 representing no difference in their means. This, along with the 95% confidence intervals, provides an illustrative insight into pairwise variations.
As an additional implementation of linear models, we now aim at answering the following research question: how do dog counts evolve over time? To do so we will build a linear model that will provide some information about the trends in the time series of registered dog count data over the time period recorded, as well as some predictions for the following 10 years.
##
## Call:
## lm(formula = TotalDogs ~ DistrictSort + KeyDateYear, data = annual_dog_counts)
##
## Residuals:
## Min 1Q Median 3Q Max
## -132.183 -29.459 -8.875 27.640 143.831
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -53518.004 3760.642 -14.231 < 2e-16 ***
## DistrictSort2 617.667 24.174 25.551 < 2e-16 ***
## DistrictSort3 585.889 24.174 24.236 < 2e-16 ***
## DistrictSort4 286.444 24.174 11.849 < 2e-16 ***
## DistrictSort5 130.778 24.174 5.410 4.54e-07 ***
## DistrictSort6 415.889 24.174 17.204 < 2e-16 ***
## DistrictSort7 974.111 24.174 40.296 < 2e-16 ***
## DistrictSort8 292.222 24.174 12.088 < 2e-16 ***
## DistrictSort9 853.556 24.174 35.309 < 2e-16 ***
## DistrictSort10 602.778 24.174 24.935 < 2e-16 ***
## DistrictSort11 1202.222 24.174 49.732 < 2e-16 ***
## DistrictSort12 430.444 24.174 17.806 < 2e-16 ***
## DistrictSort15 -98.763 34.238 -2.885 0.00483 **
## KeyDateYear 26.570 1.863 14.265 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 51.28 on 97 degrees of freedom
## Multiple R-squared: 0.982, Adjusted R-squared: 0.9796
## F-statistic: 407.7 on 13 and 97 DF, p-value: < 2.2e-16
The coefficients for each district, represented by the
DistrictSort variables, show their effect on the total dog
count relative to a baseline district. All districts exhibit highly
significant effects on dog populations, either positive or negative,
with a p-value of less than 0.001. The annual increase in dog counts,
indicated by the KeyDateYear coefficient of 26.57, reflects
a significant positive trend over time, with a p-value of less than
0.001. This consistent annual increase in the dog population underscores
the growing trend.
The residuals, i.e. the distribution of prediction errors, go from the smallest value of -132.183 to a maximum of 143.831. The range shows some variability in the model’s error terms, yet the overall predictions are reliable. The adjusted R-squared of 98% is very good as it shows that 98% of the data variation is explained by this model.
So the analysis reveals a positive trend over time, suggesting a growing market for dog-related products and services. Pet Paradise can expect to see a sustained and potentially increasing demand for products in their sector.
To make best on the growth in the dog population, strategic recommendations focus on high-growth districts i.e., prioritizing districts with highest positive coefficients such as District 12 (near Schwamendingen Mitte) which has the highest positive coefficient of 1202, indicating the largest growth in dog counts. District 7, known for its expensive neighborhoods around Zürichberg, shows the second highest growth with a coefficient of 974. District 9 also demonstrates substantial growth with a coefficient of 853.
We constructed a Poisson Generalized Linear Model (GLM) to estimate
the number of dog registrations (NumberOfDogs) based on the
predictors KeyDateYear and DistrictCd. The
goal was to identify trends and distributions in dog ownership across
Zurich’s neighborhoods to support Pet Paradise’s expansion strategy.
first_poisson_model <- glm(NumberOfDogs ~ KeyDateYear + DistrictCd,
family = poisson,
data = df_EN)
summary(first_poisson_model)
##
## Call:
## glm(formula = NumberOfDogs ~ KeyDateYear + DistrictCd, family = poisson,
## data = df_EN)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.6227053 2.9106351 0.214 0.831
## KeyDateYear -0.0003063 0.0014414 -0.213 0.832
## DistrictCd -0.0001025 0.0011205 -0.091 0.927
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 191.58 on 70966 degrees of freedom
## Residual deviance: 191.53 on 70964 degrees of freedom
## AIC: 142281
##
## Number of Fisher Scoring iterations: 4
cv_model
## Generalized Linear Model
##
## 70967 samples
## 2 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 56774, 56774, 56773, 56774, 56773
## Resampling results:
##
## RMSE Rsquared MAE
## 0.05929555 0.0002289192 0.006880633
We now interpret the results from the Generalized Linear Poisson Model:
KeyDateYear: The coefficient for
KeyDateYear is -0.0003 with a p-value of 0.832. There’s no
significant trend over the years in the number of dog registrations.
Hence, seasonal trends in dog registrations are not a primary factor for
Pet Paradise’s planning.
DistrictCd: The coefficient for the neighborhood is
-0.0001 with a p-value of 0.927, so no significant difference in dog
registrations across different districts. This suggests that
district-specific variations in dog registrations might not be
substantial.
Specifically in terms of client recommendations, geographical
expansion can be addressed. The stable registration numbers mean that
given that neither KeyDateYear nor DistrictCd
significantly impact the number of dog registrations, dog ownership
seems stable across different years and districts. Therefore, Pet
Paradise can consider expanding uniformly across districts rather than
focusing on specific areas with presumed higher dog populations.
Therefore in conclusion, this poisson GLM analysis shows a stability in numbers across Zurich. Pet Paradise should leverage this stability and expand based on other factors.
Model Deviance and AIC:
With a null deviance of 191.58 on 70,966 degrees of freedom, the high value indicates considerable variation in the number of dog registrations across the dataset. The residual deviance of 191.53 on 70,964 degrees of freedom how a minimal reduction from the null deviance which means that the predictors in the model do not significantly improve the fit compared to the null model.
The high AIC value indicates that while the model may have a reasonable fit, it is quite complex relative to the amount of information it provides. A Fisher Scoring Iterations of four iterations suggests that the model parameters have quickly converged, which we can expect for GLMs with well-behaved data.
Regarding the cross-validation, we used a 5-fold validation,
i.e. breaking the data into five subsets, training on four, and testing
on the fifth. The low Root Mean Square Error of 0.0593 shows that the
model’s predictions are close to the actual values of dog registrations.
This R-squared shows that the predictors (here KeyDateYear
and DistrictCd) explain almost none of the variability in
the number of dog registrations. This is consistent with the
insignificant coefficients observed.
In terms of client insights, the null hypothesis was proven correct in this case; the stability in the number of dog registrations across different years and districts suggests that Pet Paradise can plan for uniform expansion without focusing on specific districts. Other factors such as owner’s age, whether they own a pedigree dog or not, etc. might be more influential in determining the demand for pet services, which we will examine in further models.
The above visualization helps us see that there is a poor fit; if the model were performing well, it would be a more diagonal spread of points, showing a clear linear relationship between actual and predicted values. The current chart does not show this pattern, so it is suggesting that the model is not capturing the variability in the actual data. Upon reflection, the model was refined to sum the number of dogs per year, i.e. aggregate it.
summary(second_poisson_model)
##
## Call:
## glm(formula = NumberOfDogs ~ KeyDateYear + DistrictCd, family = poisson,
## data = sum_data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -81.666277 2.925105 -27.92 <2e-16 ***
## KeyDateYear 0.043686 0.001448 30.16 <2e-16 ***
## DistrictCd -0.009640 0.000368 -26.20 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 24077 on 110 degrees of freedom
## Residual deviance: 22173 on 108 degrees of freedom
## AIC: 23067
##
## Number of Fisher Scoring iterations: 11
## `geom_smooth()` using formula = 'y ~ x'
To check which poisson model is better, we compare the goodness-of-fit. This can be done using AIC because the models come from the same dataset, have the same response variable, as well as the same poisson distribution. We also compare coefficients.
summary_first <- summary(first_poisson_model)
summary_first
##
## Call:
## glm(formula = NumberOfDogs ~ KeyDateYear + DistrictCd, family = poisson,
## data = df_EN)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.6227053 2.9106351 0.214 0.831
## KeyDateYear -0.0003063 0.0014414 -0.213 0.832
## DistrictCd -0.0001025 0.0011205 -0.091 0.927
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 191.58 on 70966 degrees of freedom
## Residual deviance: 191.53 on 70964 degrees of freedom
## AIC: 142281
##
## Number of Fisher Scoring iterations: 4
summary_first$deviance[1]
## [1] 191.5258
summary_first$deviance[2]
## [1] NA
summary_first$coefficients
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.6227052614 2.910635134 0.21394137 0.8305928
## KeyDateYear -0.0003062975 0.001441371 -0.21250433 0.8317136
## DistrictCd -0.0001024754 0.001120495 -0.09145546 0.9271307
summary_second <- summary(second_poisson_model)
summary_second
##
## Call:
## glm(formula = NumberOfDogs ~ KeyDateYear + DistrictCd, family = poisson,
## data = sum_data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -81.666277 2.925105 -27.92 <2e-16 ***
## KeyDateYear 0.043686 0.001448 30.16 <2e-16 ***
## DistrictCd -0.009640 0.000368 -26.20 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 24077 on 110 degrees of freedom
## Residual deviance: 22173 on 108 degrees of freedom
## AIC: 23067
##
## Number of Fisher Scoring iterations: 11
summary_second$deviance[1]
## [1] 22172.84
summary_second$deviance[2]
## [1] NA
summary_second$coefficients
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -81.666277212 2.9251048123 -27.91909 1.564796e-171
## KeyDateYear 0.043686187 0.0014485481 30.15860 8.272900e-200
## DistrictCd -0.009639715 0.0003679734 -26.19678 2.891998e-151
There is a big difference between the null deviance (24077) and residual (22173). The degrees of freedom here are lower, so the predictors improved the model’s fit. The AIC value (23’067) is lower than the first model’s AIC (142’281), meaning a better fit. The coefficients of the second model are statistically significant where p-value < 0.001, and shows an increase of registrations per year (RefYear = 0.044). The negative district coefficient needs to be log transformed to be interpreted since it’s a Poisson model.
exp(District coefficient) = exp(−0.009640) = ca. 0.96
The above value is a percentage, so for every one-unit increase in District (ie just one neighborhood to the next), the predicted number of dog registrations approximately decreases by 0.96%.
As a second implementation of Generalized Linear Models, we return to the previously introduced research question: how do dog counts evolve over time?
To address the research question how do dog counts evolve over time? we present the following interpretation of the GLM.
The coefficient of the yearly growth rate of dog counts is positive, so we have an increasing trend in dog ownership over time, and it is statistically significant. This is especially for District 11 (1202), District 7 (974) and District 9 (853). But only District 15 had a decrease in dogs over the last 10 years (for every 1 year, there were 98 fewer dogs in neighborhood District 15). From the visualization, it’s seen that particularly District 11 had an uptick in dog numbers from 2020, perhaps as people stayed in home office more due to the Covid 19 pandemic.
Regarding the model’s statistics, the model has a strong fit. We see that thanks to the following:
Pet Paradise can focus on higher growth districts, such as District 11, District 7 and District 9.
We introduction our binomial GLM section by defining the following goal: predicting a dog’s sex based on its age. In order to work with such models, the response variable must first be transformed to a 0-1 response, as it is originally coded as either 1 for male or 2 for female. Our objective is to assess if there is any significant relationship between dog age and the likelihood of it being either male or female.
glm_dog_sex_age <- glm(DogSexCd ~ DogAgeGroupCd, family = binomial, data = df_EN)
summary(glm_dog_sex_age)
##
## Call:
## glm(formula = DogSexCd ~ DogAgeGroupCd, family = binomial, data = df_EN)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.016374 0.010949 -1.495 0.1348
## DogAgeGroupCd 0.003459 0.001351 2.561 0.0104 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 98381 on 70966 degrees of freedom
## Residual deviance: 98368 on 70965 degrees of freedom
## AIC: 98372
##
## Number of Fisher Scoring iterations: 4
exp_coef <- exp(coef(glm_dog_sex_age))
percentage_change <- (exp_coef - 1) * 100
percentage_change
## (Intercept) DogAgeGroupCd
## -1.6240642 0.3464974
The coefficient for DogAgeGroupCd is 0.0035 with a
p-value of 0.0104, indicating statistical significance. For every year
increase in dog age, the probability of being female increase by 0.35%,
holding other variables constant. While the statistical significance of
the coefficient for a dog’s age suggests that there is evidence to
support the relationship between dog age and the probability of being
female, the practical significance of a 0.35% increase in odds may not
be substantial enough to warrant immediate business decisions based
solely on this finding.
Further analysis may be needed here. Although the hypothesis was to offer products tailored to the dog’s age and Sex, and while age appears to influence sex likelihood, additional factors such as breed and size would provide a more sound business decision making.
So more investigation into factors would refine predictions. We look at dog breeds.
| Breed | Unknown | Chihuahua | Labrador Retriever | Yorkshire Terrier | Jack Russel Terrier |
|---|---|---|---|---|---|
| Count | 9095 | 4828 | 4198 | 2709 | 2579 |
The most popular dog is a unknown ie. mixed dog breed. So Pet Paradise must offer mixed breed foods and products. Not only purebred product offerings.
So instead, another sales approach. Let us say, Pet Paradise is trying to target specific dog or owner age groups for marketing or sales purposes. Pet Paradise wants to predict the likelihood of a pet owner in their 40s owning a top 5 breed (e.g., the Chihuahua) compared to owning an unknown breed. For this, a binomial logistic regression makes most sense to use. This is because the response variable is binary: either a pet owner owns a Chihuahua (let say, coded as 1) or they own an unknown breed (we will code this as 0).
In terms of identifying popular breeds based on age and sex, Pet Paradise wants to use a model’s predictions to optimize inventory management by stocking up on products that are likely to be in higher demand based on the popularity of a given pedigree breed.
chihuahua_binomial <- glm(ChihuahuaOwned ~ 1, family = binomial, data = owner_40s_df)
summary(chihuahua_binomial)
##
## Call:
## glm(formula = ChihuahuaOwned ~ 1, family = binomial, data = owner_40s_df)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.29617 0.03527 -36.75 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4960.2 on 4766 degrees of freedom
## Residual deviance: 4960.2 on 4766 degrees of freedom
## AIC: 4962.2
##
## Number of Fisher Scoring iterations: 4
chihuahua_exp_coef <- exp(coef(chihuahua_binomial))
chihuahua_exp_coef
## (Intercept)
## 0.2735773
chihuahua_percentage_change <- (chihuahua_exp_coef - 1) * 100
chihuahua_percentage_change
## (Intercept)
## -72.64227
We see that the log-odds of a 40-year-old owning a Chihuahua are -1.29, which is statistically significant. The exponentiated coefficient is 0.27, indicating that the odds of a 40-year-old owning a Chihuahua are 27%. If Pet Paradise targets typically well-earning professionals, i.e., adults in their 40s, and since the likelihood of a dog owner in their 40s owning a Chihuahua is relatively low, Pet Paradise diversify marketing efforts away from pedigree focus and rather highlight a broader range of mixed-breed foods, fur-shampoos and other products. This strategy will help attract well-earning customers who may own mixed or different breeds.
To check a last potential business case, we can check the reverse;
predict which owner age group is more likely to own a Chihuahua, because
we have a binary variable (ChihuahuaOwned: 1 for ownership
and 0 for no ownership) and the remaining predictor variables
(PrimaryBreed, DogAgeGroupCd,
DogSexCd, and OwnerAgeGroupCd). We will try o
predict which owner age group is more likely to own a Chihuahua.
chihuahua_age_bracket <- glm(formula = ChihuahuaOwned ~ PrimaryBreed + DogAgeGroupCd + DogSexCd + OwnerAgeGroupCd, family = binomial, data = owner_40s_df)
summary(chihuahua_age_bracket)
# Collinearity because model didn't converge. We calculate variance inflation factors
vif_values <- car::vif(chihuahua_age_bracket)
vif_values
The VIF produced by the above code shows that the model has perfect multicollinearity. It creates in linear dependencies among predictor variables. What this means, is that when there are categorical variables with many levels, it leads to unreliable predictions. So, if Pet Paradise is trying to predict which customers want to buy a product based on their age braket, breed of dog, etc, such a model with these singularities might suggest targeting certain groups of customers when, in fact, the data is too ambiguous to make such recommendations confidently We will compare the first two of the three models to determine the better fit instead.
cat("Model 1 - Dog Sex vs. Age:\n")
## Model 1 - Dog Sex vs. Age:
cat("AIC:", aic_model1, "\n")
## AIC: 98372.3
cat("BIC:", bic_model1, "\n\n")
## BIC: 98390.64
cat("Model 2 - Chihuahua Ownership in 40s:\n")
## Model 2 - Chihuahua Ownership in 40s:
cat("AIC:", aic_model2, "\n")
## AIC: 4962.163
cat("BIC:", bic_model2, "\n")
## BIC: 4968.632
We see from the Akaike Information Criterion and Bayesian Information Criterion that the second model, i.e. Chihuahua Ownership in an owner’s 40s, has much lower AIC and BIC values than the Dog Sex vs. Age model. So model 2 provides a better fit to the data. Because the models contain fewer than 2 terms, checking collinearity doesn’t make sense. We check the confusion matrices instead.
# Confusion matrix for Dog Sex vs. Age
predicted_probabilities <- predict(glm_dog_sex_age, type = "response")
predicted_classes <- ifelse(predicted_probabilities > 0.5, 1, 0)
confusion_glm1 <- table(df_EN$DogSexCd, predicted_classes)
confusion_glm1
## predicted_classes
## 0 1
## 0 15813 19596
## 1 14672 20886
The first confusion matrix shows that there are correctly predicted the negatives was 15,813 instances. The false positive representing incorrectly predicted the positive cases was 19,596 false positives. There were 14,672 false negative predictions and 20,886 true positives. This means the first binomial model has an accuracy of 52% and a precision of 52%. The Sensitivity was around 59%, where of all actual positive instances were correctly identified by the model.
The client question of this analysis was to develop and compare binomial logistic regression models to predict outcomes regarding dog ownership based on given variables. We examined on two main models: one predicting the sex of the dog based on its age group and another predicting the most popular pedigree ownership among dog owners in their prime income earning years in their 40s.
The first model offered insights into the relationship between dog age and sex but had limited predictive power. The second model provided a baseline probability of pedigree ownership based on created binary variables of ownership vs non-ownership to target a very specific demographic niche. While attempting to include multiple predictors in the third pedigree ownership model, perfect multicollinearity was detected, resulting in unstable estimates and difficulties in model convergence. For Pet Paradise, focusing on targeting broader product ranges to well-earning professionals, rather than narrowly focusing on pedigree breeds, could be more effective.
To assess Pet Paradise on the question of evolution of popular dog breeds we will employ Generalized Additive Models (GAM). Such models can capture the non-linear patterns inherent in the popularity fluctuations of various dog breeds throughout the years. This analysis will enable Pet Paradise to make informed predictions about future trends, thereby facilitating their ability to anticipate the demand for breed-specific products.
First we have a look at all districts together.
##
## Family: poisson
## Link function: log
##
## Formula:
## BreedCount ~ te(KeyDateYear, by = PrimaryBreed, k = 8) + s(PrimaryBreed,
## bs = "re")
##
## Parametric coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 5.8495 0.1577 37.09 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df Chi.sq p-value
## te(KeyDateYear):PrimaryBreedChihuahua 2.324 2.877 12.554 0.00638 **
## te(KeyDateYear):PrimaryBreedJack Russel Terrier 1.000 1.000 0.513 0.47390
## te(KeyDateYear):PrimaryBreedLabrador Retriever 2.647 3.263 37.604 < 2e-16 ***
## te(KeyDateYear):PrimaryBreedMalteser 1.972 2.448 52.404 < 2e-16 ***
## te(KeyDateYear):PrimaryBreedYorkshire Terrier 1.000 1.000 3.371 0.06637 .
## s(PrimaryBreed) 3.989 4.000 1597.524 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.994 Deviance explained = 99.6%
## -REML = 209.97 Scale est. = 1 n = 45
The GAM visualization displays the evolving popularity of various dog
breeds over time and makes the non-linear patterns in the breed
popularity fluctuations obvious across time. The model, constructed with
a Poisson family and a log link function, aims to predict breed counts
by incorporating a tensor product smooth term for time
(KeyDateYear) and breed (PrimaryBreed), along
with a smooth term for breed alone. The coefficients derived from the
model highlight significant associations between time and the popularity
of specific breeds, i.e., breeds like Labradors and Maltesers show
noticeable patterns in popularity over time, as indicated by their
significant smooth terms. Finally, the model’s high adjusted R-squared
value highlight its good fit in explaining the data variability.
Next we direct our attention to each individual city district.
## 'data.frame': 540 obs. of 4 variables:
## $ KeyDateYear : num 2015 2015 2015 2015 2015 ...
## $ PrimaryBreed: Factor w/ 394 levels "$Labradoodle$",..: 94 94 94 94 94 94 94 94 94 94 ...
## $ DistrictSort: Factor w/ 12 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ BreedCount : int 10 37 59 28 10 20 25 15 76 40 ...
##
## Family: poisson
## Link function: log
##
## Formula:
## BreedCount ~ s(KeyDateYear, bs = "cr", k = 5) + s(PrimaryBreed,
## bs = "re")
##
## Parametric coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.5962 0.1788 8.926 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df Chi.sq p-value
## s(KeyDateYear) 1.00 1.001 2.144 0.143
## s(PrimaryBreed) 3.42 4.000 32.862 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.822 Deviance explained = 81.9%
## -REML = 89.723 Scale est. = 1 n = 45
From the summary, we see that the dataset comprises 540 observations across 12 district levels, tracking variables such as the year and primary dog breed the district and breed count. So this GAM also uses a smooth term for time thanks to a cubic regression spline but with a complexity parameter (k) set to 5, alongside a smooth term for the primary dog breed. The smooth terms show significant relationships between the specific dog breed popularity and time, and interestingly enough also between breeds and districts. While the time significance is marginal, the breed component shows a substantial effect on breed popularity across districts. We can observe how the model’s adjusted R-squared value explain slightly less data variability at 82%, but it is still a good fit. We recommend for our client, Pet Paradise, that this local analysis presents into district-specific trends in dog breed popularity; they can target breeds like the Labrador and Chihuahua in particular, with marketing tailored to those dogs’ needs.
For the second part of the Generalized Additive Model chapter we again direct our attention to the previously posed research question: how do dog counts evolve over time?
As a straightforward implementation of the GAM model, we set ourselves to produce a regression of the registered count data, grouped by districts of the city, analyze their evolution over time, and additionally provide predictions for the next 10 years.
##
## Family: poisson
## Link function: log
##
## Formula:
## TotalDogs ~ s(KeyDateYear, bs = "cr", k = 4) + s(DistrictSort,
## bs = "re")
##
## Parametric coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 5.8853 0.4767 12.35 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df Chi.sq p-value
## s(KeyDateYear) 2.719 2.939 911.6 <2e-16 ***
## s(DistrictSort) 11.948 12.000 16051.0 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.996 Deviance explained = 99.7%
## -REML = 555.44 Scale est. = 1 n = 111
We now proceed to further omptimize the GAM model. The smooth term complexity parameter is now k = 8 and we see that it allows for a more detailed, fitted representation of change over time. Despite the adjusted R-squared value dropping by 0.01% from 99.7% to 99.6%, the model shows a high level of significance for both smooth terms, explaining a large portion of the data variability.
##
## Family: poisson
## Link function: log
##
## Formula:
## TotalDogs ~ s(KeyDateYear, bs = "cr", k = 8) + s(DistrictSort,
## bs = "re")
##
## Parametric coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 6.322 0.189 33.46 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df Chi.sq p-value
## s(KeyDateYear) 4.053 4.881 913 <2e-16 ***
## s(DistrictSort) 10.994 11.000 15837 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.996 Deviance explained = 99.6%
## -REML = 538.2 Scale est. = 1 n = 108
This GAM model also uses a Poisson distribution with a log link
function, but the a smooth term for time has a complexity parameter (k)
set to 4, and another smooth term for the district
(DistrictSort). Again, there are positive correlations
between dog counts and both time and district, as seen in the
visualization. What we see is the noticeable impact of time and district
on the increase in dog populations. Our recommendation to the client is
that for the next 10 years, Pet Paradise can expect positive dog
population growth trends as a forecast, and expand their business
accordingly to meet the dog owners demands and needs in each
district.
We now look at some comparisons now between the above two models, as well as the GLM that was introduced earlier.
The residuals versus fitted values scatterplot appears without patter, indicating that the models have caught most of the data variation, although the GLM does have the most outliers. The simple GLM appears to have the least outliers. The visualization shows homoscedasticity, ie a consistent spread, which is good. All models are satisfactory for this business question.
aic_comparison <- AIC(gam_model_simple, gam_model_refined)
aic_comparison
> aic_comparison
df AIC
gam_model_simple 15.88798 999.1300
gam_model_refined 16.88099 987.7553
The lower AIC value from the refined version of the GAM model shows the improvement of its accuracy.
We start this chapter by posing the research question: is a dog of pure or mixed breed, based on location and owner and dog’s characteristics?
An artificial neural network (ANN) is the algorithm of choice for this assessment due to its capability to handle complex, non-linear relationships within the dataset. Dogs’ breed may depend on various interacting factors, such as age, size, location, and owner demographics, which ANNs can effectively capture and analyze.
Their scalability and adaptability make them robust for ongoing studies, while their multilayered learning allows the model to automatically identify the most significant features. We ultimately aim to accurately predict breed status and provide valuable insights for our suggested business project, as well as pet adoption agencies, veterinarians, or even urban planners of the city of Zurich.
Continuing with the development of models, and answering the above,
we now look into the implementation of an Artificial Neural Network
using the packages nnet and caret in R. The
resulting classification result for the dependent variable will be
produced by the model considering variables of both numerical and
categorical type.
The target variable MixedBreed is of categorical type
(factor in R), indicating the dog’s pedigree status, with 4 different
possible responses, from pure breed to 3 different descriptors of breed
mixing. For the sake of simplicity, the levels within the
MixedBreed factor variable have been reduced to a binary
response, indicating whether the dog is of pure pedigree or not. We
consider that this response better fits the information needed for our
business case.
As explained above, the response variables of choice pertain to
characteristics of the dog owners (their age, sex, and district), as
well as some characteristics of the dogs (dog age and sex). The
predictors are the following: OwnerAgeGroupCd,
OwnerSexCd, DistrictCd,
DogAgeGroupCd and DogSexCd. Although the model
results suggest that incorporating additional predictors should be done
to improve the model’s accuracy, we have kept the current selection for
learning purposes.
The libraries of choice are nnet and caret,
the first one providing the functions for creating and training the
model, and the latter providing an interface to further preprocess and
train the model. This allowed to implement a 10-fold cross-validation
during the training phase of the model building. It also allowed to
establish a tune grid with various possible hyperparameters to test out
different combinations and find the most optimal ones.
breed_net
## Neural Network
##
## 56774 samples
## 5 predictor
## 2 classes: 'Mixed breed', 'Pedigree dog'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 51096, 51096, 51097, 51097, 51096, 51097, ...
## Resampling results across tuning parameters:
##
## size decay Accuracy Kappa
## 5 1e-04 0.7144468 0.0004926028
## 5 1e-03 0.7145877 0.0005098842
## 5 1e-02 0.7142530 0.0014752396
## 10 1e-04 0.7144468 0.0023939219
## 10 1e-03 0.7143235 0.0046166926
## 10 1e-02 0.7146406 0.0059293934
## 15 1e-04 0.7138479 0.0035546704
## 15 1e-03 0.7146581 0.0088259620
## 15 1e-02 0.7144467 0.0082564278
##
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were size = 15 and decay = 0.001.
breed_net$finalModel
## a 16-15-1 network with 271 weights
## inputs: OwnerAgeGroupCd OwnerSexmale `DistrictKreis 10` `DistrictKreis 11` `DistrictKreis 12` `DistrictKreis 2` `DistrictKreis 3` `DistrictKreis 4` `DistrictKreis 5` `DistrictKreis 6` `DistrictKreis 7` `DistrictKreis 8` `DistrictKreis 9` `DistrictUnknown (Stadt Zürich)` DogAgeGroupCd DogSexmale
## output(s): .outcome
## options were - entropy fitting decay=0.001
After running the code, the final model is a neural network with 16 input nodes, 15 hidden nodes, and 1 output node, totalling 271 weights. However, an evaluation of the confusion matrix and ROC curves indicates insufficient evidence to support the model’s validity for the given variables. This issue might stem from the selection of variables; expanding the set to include more variables from the dataset might improve the model. Alternatively, it could be that an artificial neural network is not the most suitable model for addressing this particular research question.
After building the model with an 80% training subset of the total dataset, we use it to predict values for the remaining 20%. We evaluate the model’s performance using a confusion matrix, which allowe us to compare our predicted values with the actual values of the test subset. The model successfully classifies 10,098 dogs as pure-bred, with only 44 being misclassified. However, this good result is overshadowed by the misclassification of 4,020 mixed-breed dogs, with only 31 correct predictions. This suggests that the model wrongly tends to classify most dogs as pure-bred.
The ROC Curve displays a closeness to the diagonal, which raises the possibility that the model’s predictions are not accurate, and support the need to either review the model with more predictors or consider a different model to answer the question at hand altogether.
In the last chapter of our series on machine learning models we aim to predict the age group of dog owners based on the ages of their dogs at time of registration, for which we will implement a Support Vector Machine (SVM). As a preliminary step, observations belonging to both variables are seen to be assigned the numerical value 999, suggesting ‘unknown’. In order to not exclude these points from the training and testing subsets, we assign them the mean value of the rest of the set.
As a prelude to the model building we can have a look at the above plot, which portrays the relationship between the relevant response and predictor variables. The size of the dots reflect the frequency of observations falling under such combination of age groups. The mean-assigned unknown observations stand out for not fitting onto the grid, reflecting their non-integer nature against the original integer values. Keeping them with their decimals was preferred, as assigning them to either one multiple of 10 value above or below was considered to be too much of an alteration of their values.
What becomes visible at a first glance is the concentration of values around the area corresponding to younger dogs and owners aged 30-40. We can expect more precise predictions around this cluster based on this assumption.
For model evaluation, these numbers are cast back to integers, and then to factors to produce interpretable confusion matrices. It is worth noting that these preparation measures could be undertaken in many different ways, with potential improvements on the models. We will leave such considerations for future iterations of the research project, and concentrate on developing the models with the main purpose of becoming aquainted with the relevant tools and evaluation methods.
With the help of the caret framework we implement model
building functions from the e1071 library, along with easy
tuning and training control methods. The data is initially sampled into
proportional 70% training and 30% testing sets. Additionally, a 10-fold
cross-validation optimization procedure is implemented, with 3
repetitions.
svm_linear
## Support Vector Machines with Linear Kernel
##
## 49673 samples
## 1 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 44705, 44707, 44705, 44705, 44705, 44705, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 15.07558 0.09038973 12.39084
##
## Tuning parameter 'C' was held constant at a value of 1
The chosen model with linear kernel function is assigned a cost parameter of 1. We now go ahead and look at the results.
The predicted results have been adjusted and rounded to their nearest multiple of 10, in order to be able to visualize them against the actual decade time ranges represented in the original data. The following confusion plot shows the frequency of guesses for each age range.
The above plot shows all predictions fall exclussively under the 40-60 owner age ranges, with most guesses corresponding to 40. Although the results are relatively poor, the model does capture the main owner demographic being the owners aged 30 to 40.
confMa_linear
## Confusion Matrix and Statistics
##
## Reference
## Prediction 10 20 30 40 50 60 70 80 90
## 10 0 0 0 0 0 0 0 0 0
## 20 0 0 0 0 0 0 0 0 0
## 30 0 0 0 0 0 0 0 0 0
## 40 58 1566 3175 2789 2359 1469 774 151 13
## 50 0 383 1522 1625 2143 1483 1137 404 45
## 60 0 0 17 34 48 45 29 17 3
## 70 0 0 0 0 0 0 0 0 0
## 80 0 0 0 0 0 0 0 0 0
## 90 0 0 0 0 0 0 0 0 0
##
## Overall Statistics
##
## Accuracy : 0.2338
## 95% CI : (0.2281, 0.2395)
## No Information Rate : 0.2214
## P-Value [Acc > NIR] : 8.294e-06
##
## Kappa : 0.0298
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 10 Class: 20 Class: 30 Class: 40 Class: 50 Class: 60 Class: 70 Class: 80
## Sensitivity 0.000000 0.00000 0.0000 0.6270 0.4710 0.015015 0.00000 0.00000
## Specificity 1.000000 1.00000 1.0000 0.4320 0.6058 0.991909 1.00000 1.00000
## Pos Pred Value NaN NaN NaN 0.2258 0.2451 0.233161 NaN NaN
## Neg Pred Value 0.997276 0.90845 0.7786 0.8143 0.8082 0.860068 0.90887 0.97313
## Prevalence 0.002724 0.09155 0.2214 0.2089 0.2137 0.140777 0.09113 0.02687
## Detection Rate 0.000000 0.00000 0.0000 0.1310 0.1007 0.002114 0.00000 0.00000
## Detection Prevalence 0.000000 0.00000 0.0000 0.5803 0.4106 0.009066 0.00000 0.00000
## Balanced Accuracy 0.500000 0.50000 0.5000 0.5295 0.5384 0.503462 0.50000 0.50000
## Class: 90
## Sensitivity 0.000000
## Specificity 1.000000
## Pos Pred Value NaN
## Neg Pred Value 0.997135
## Prevalence 0.002865
## Detection Rate 0.000000
## Detection Prevalence 0.000000
## Balanced Accuracy 0.500000
A closer look at the confusion matrix shows the accuracy is quite low, at 23%. We opt for changing the kernel function, as it may capture additional relevant patterns in the data.
svm_rbf
## Support Vector Machines with Radial Basis Function Kernel
##
## 49678 samples
## 1 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 44711, 44710, 44708, 44710, 44711, 44710, ...
## Resampling results across tuning parameters:
##
## C RMSE Rsquared MAE
## 0.25 15.12327 0.08196042 12.38584
## 0.50 15.12376 0.08191171 12.38624
## 1.00 15.12445 0.08184675 12.38678
## 2.00 15.12475 0.08181964 12.38697
## 4.00 15.12485 0.08180993 12.38701
## 8.00 15.12518 0.08177908 12.38716
## 16.00 15.12528 0.08177042 12.38727
## 32.00 15.12531 0.08176758 12.38733
## 64.00 15.12531 0.08176760 12.38734
## 128.00 15.12529 0.08177019 12.38737
##
## Tuning parameter 'sigma' was held constant at a value of 8.77719
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 8.77719 and C = 0.25.
The radial kernel SVM model is built with similar characteristics to the linear one, with a sigma parameter valued at 8.77.
The confusion matrix in this case follows an almost identical pattern, with very slight differences in frequencies. It again is limited to the 40-60 age ranges, and captures most intensely the most populated demographic sector.
confMa_radial
## Confusion Matrix and Statistics
##
## Reference
## Prediction 10 20 30 40 50 60 70 80 90
## 10 0 0 0 0 0 0 0 0 0
## 20 0 0 0 0 0 0 0 0 0
## 30 0 0 0 0 0 0 0 0 0
## 40 58 1566 3175 2789 2359 1469 774 151 13
## 50 0 383 1522 1625 2143 1483 1137 404 45
## 60 0 0 17 34 48 45 29 17 3
## 70 0 0 0 0 0 0 0 0 0
## 80 0 0 0 0 0 0 0 0 0
## 90 0 0 0 0 0 0 0 0 0
##
## Overall Statistics
##
## Accuracy : 0.2338
## 95% CI : (0.2281, 0.2395)
## No Information Rate : 0.2214
## P-Value [Acc > NIR] : 8.294e-06
##
## Kappa : 0.0298
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 10 Class: 20 Class: 30 Class: 40 Class: 50 Class: 60 Class: 70 Class: 80
## Sensitivity 0.000000 0.00000 0.0000 0.6270 0.4710 0.015015 0.00000 0.00000
## Specificity 1.000000 1.00000 1.0000 0.4320 0.6058 0.991909 1.00000 1.00000
## Pos Pred Value NaN NaN NaN 0.2258 0.2451 0.233161 NaN NaN
## Neg Pred Value 0.997276 0.90845 0.7786 0.8143 0.8082 0.860068 0.90887 0.97313
## Prevalence 0.002724 0.09155 0.2214 0.2089 0.2137 0.140777 0.09113 0.02687
## Detection Rate 0.000000 0.00000 0.0000 0.1310 0.1007 0.002114 0.00000 0.00000
## Detection Prevalence 0.000000 0.00000 0.0000 0.5803 0.4106 0.009066 0.00000 0.00000
## Balanced Accuracy 0.500000 0.50000 0.5000 0.5295 0.5384 0.503462 0.50000 0.50000
## Class: 90
## Sensitivity 0.000000
## Specificity 1.000000
## Pos Pred Value NaN
## Neg Pred Value 0.997135
## Prevalence 0.002865
## Detection Rate 0.000000
## Detection Prevalence 0.000000
## Balanced Accuracy 0.500000
The results from the confusion matrix of the second model show an identical accuracy of 23% The following suggestions can be made based on both results for a further development of the study:
A noteworthy addition to the construction and testing of diverse machine learning models pertains to the methodology employed, particularly within the R programming environment. Given that our data science team consists of three members from different cultural and professional backgrounds, we prioritized making the dataset accessible to everyone. This led us to dedicate a chapter in the early phases of the project to translating the dataset into English.
In today’s dynamic field of data science, our experience has shown us that R is the go-to language for machine learning and statistical analysis, while Python dominates a wider array of applications. As Python enthusiasts, crafting an efficient yet straightforward translation script in R seemed daunting at first, pushing us beyond our comfort zone. This experience turned into an intricate exercise, allowing us to deepen our fluency in R through practical application of code patterns like conditional statements, loops, and function declarations. It became a valuable opportunity to expand our skills and reinforce our understanding of R’s unique capabilities.
We will now provide a brief explanation of the logic and implementation of this translation script. The translation focused on names of columns and key patterns found within the categorical responses. These were defined manually by the members of the team able to work in German.
A simple function declaration looked for patterns in the text components of the variable responses, that could seamlessly be translated to English without minute interventions.
replace_patterns <- function(text, patterns, replacements) {
for (i in seq_along(patterns)) {
text <- str_replace_all(text, patterns[i], replacements[i])
}
return(text)
}
Such patterns would take the following form:
patterns <- c("- bis ", "-Jährige", "männlich", "weiblich", ...)
replacements <- c(" to ", " years old", "male", "female", ...)
The only remaining complex task was replacing the dog names with their English words:
color_patterns <- c("schwarz", "braun", "weiss", "grau", ...)
color_replacements <- c("black", "brown", "white", "gray", ...)
df_EN$DogColor <- replace_patterns(df_EN$DogColor, color_patterns, color_replacements)
Lastly, dog breeds were deliberately chosen to be left in their original form, given their complex descriptive nature, as well as the vast dimensionality of that variable.
A noteworthy aspect of our project was the emphasis on integrating data-driven research with a clean and sophisticated user-friendly interface. To achieve this, we extensively utilized the Shiny Apps framework, particularly in the Exploratory Data Analysis section of our report. This focus was crucial for directing us towards a correct understanding of the data and its underlying patterns, enabling us to formulate the right questions and match them with the most appropriate tools.
In addition to Shiny Apps, many of the plots generated with the ggplot2 library were made interactive through the implementation of the Plotly framework. This interactivity enhanced the analytical capabilities and accessibility of our visualizations, making them more informative and engaging for users.
A significant challenge we faced was producing an efficient RMarkdown script and integrating chapters developed by individual team members. Our collaboration platform of choice was GitHub, where we maintained an organized and intuitive directory structure. To streamline the knitting process, we cached some of the most complex models and elements manually, storing them efficiently within the agreed folder structure. This approach minimized computational load and improved the reproducibility of our work.
We believe a successful data science project should balance robust data interpretation and coding with accessible, user-friendly presentation platforms for non-technical clients and stakeholders. Just as with machine learning methods, the applications we develop should aim for maximum reproducibility and intuitiveness, ensuring they are as effective and user-friendly as possible.
To summarize the insights and observations made during the project’s development, we will now reflect on the insights obtained as well as the challenges encountered.
Age & Neighborhood
Several of the presented statistical models prove that districts 11, 7 and 9 lead the dog demographics charts, and maintain a steady but ever-developing growth, which is parallel to the their human population’s development. It is no coincidance that these are sectors of the city which have experienced recent urban development and influx of younger families. It is expected that this growth will continue in the coming years, leading us to suggest establishing Pet Paradise branches in those locations.
These areas will also be subjected to an increasing number of registrations of younger dogs in the coming years, as a side effect of generational renewal. Subsequently, younger canines demand a choice of toys and accessories promoting physical activity and mental canine stimulation. We recommend a strategically curated collection of toys, agility equipment, and puzzle feeders to cater to energetic young dogs. Large dog populations are also prone to an increase of aging dogs in the coming years, for which we recommend Pet Paradise provides products that include joint supplements, vitamins, and specialized diets tailored to older dogs’ needs, extending the dog’s health and vitality.
Breed-specific Services
A recurrent lesson obtained from our machine learning models pertains to the growing popularity of certain dog breeds. Particularly, our GAM model clearly outlined a growing registration count of Labrador retrievers, along with Chihuahuas, Yorkshire terriers, Jack Russel terriers and Maltese dogs. The growing preference for dogs of such characteristics should support business choices regarding breed-specific products and services.
We recommend devising special marketing designed for owners of dogs that may have special fur requirements, such as the Labrador terrier or Yorkshire terrier. Also, high-energy breeds like Jack Russel terriers will require special grooming and nutritional products. So offering a range of coat shampoos and grooming essentials tailored to different fur types will be of strategical importance for the brand. We believe, providing breed-specific expertise can differentiate Pet Paradise and attract customers in prime locations.
The dataset chosen presented a series of particular hurdles from the beginning. Although conceptually interesting and complex in its structure, it mainly comprises categorical descriptors with varying levels of dimensionality. A notable example are the dog breeds and breed mixes, which encompass more than 300 possible levels. On an opposite note, the few numerical original variables are highly ‘discrete’ in nature, consisting of integer values within narrow ranges, lacking in granularity. Such is the case of the age groups of both dogs and owners. This discrepancy posed interesting challenges, that required more unconventional approaches.
Such characteristics in the data provided an opportunity to focus the research on interpreting grouped counts of items and their relationships with others, allowing us to devise research questions that leveraged this data quality. The effort generally followed the traditional sequence of posing a question and then identifying the appropriate method to answer it. This approach became more challenging with complex models, where the process often seemed reversed.
The most significant challenge was building complex models, specifically the artificial neural networks and support vector machines, which typically perform better with complex numeric relationships and more quantitative data. These observations support the frequently stated notion that simpler models often perform better with simpler datasets and are usually the best choice for addressing the research questions posed.
In reflection, the undertaking of this project underscores the efficiency and cohesive collaboration among our team. What we initially planned out as a modular effort, with each member contributing distinct elements, evolved into a collaborative process where everyone engaged in every facet of constructing the final report. One member would meticulously prepare the data and conduct analysis, another would focus on crafting and finetuning a machine learning model and assess residuals, until finally another member would document the entire progression, and implement it into the main script.
Irrespective of the technical success of our model outcomes or the quantifiable success rates of the individual steps, we collectively emerge from this experience with a deep sense of accomplishment and satisfaction.
Recognizing that generative AI tools are here to stay as invaluable companions for present and future data scientists, we leveraged these tools for support and proof-checking, significantly boosting our productivity. While AI quickly generated syntactically correct code solutions, we found that a solid understanding of the packages and general R workflows remained essential. This blend of AI assistance and deep domain knowledge allowed us to work more efficiently and effectively. Perhaps the most important aspect of working with such tools is being aware of their limitations, and the importance of promt engineering. Correctly stating the questions, with technical remarks and references to previous code, is for now the only way of guiding the LLM towards any useful solution to problems.
Following are examples of features and use cases:
Our primary tools for assistance were ChatGPT versions 3.5 and 4. The latter, boasting an image reader and direct CSV loader, revolutionized our troubleshooting process. This feature enabled us to interpret screenshots of sample plots and identify faulty ones, providing crucial context for prompted questions. With this capability, we could delve deeper into understanding the nuances of the issues at hand, ultimately leading to more informed decisions and efficient problem-solving.
One particularly beneficial application of generative AI coding solutions emerged when integrating code developed by different team members. Each member had worked with local versions of the dataframes, following different naming conventions. In these instances, we would copy the relevant code snippets into the LLM prompt and request the replacement of specific variable names with the standardized ones from the main RMarkdown report. We also made sure to instruct the LLM not to alter any user-made comments and annotations, ensuring that the original context and insights were preserved. This approach streamlined the integration process, making collaboration more seamless and efficient.
Another simple use case of AI generated code was found in the process
of converting a custom-coded ggplot object into an interactive plotly,
by asking the LLM to store the object in a separate variable, and adding
a ggplotly() call afterwards. This way, plots could be
produced and tested as regular ggplot objects by the team members in
separate script files, before being sent to the main report script and
implemented as interactive applications.
Another scenario where ChatGPT proved invaluable was during troubleshooting sessions throughout the creation of the RMarkdown report. As various team members contributed code to the main script, errors frequently interrupted the knitting process. Moreover, the rendering framework for RMarkdown lacked the detailed outputs provided by RStudio’s built-in console, often offering only the location of the faulty code and a vague description of the issue. Through the interpretation provided by ChatGPT, coupled with a copy of the problematic code snippet, pinpointing and locating these issues became much simpler. This guidance led us to a stage where, armed with insights from ChatGPT, we could implement solutions manually, enhancing our understanding and proficiency in the process.